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Genome-wide association studies (GWAS) have become 
popular as an approach for the identification of large numbers 
of phenotype-associated variants. However, differences in 
genetic architecture and environmental factors mean that the 
effect of variants can vary across populations. Understanding 
population genetic diversity is valuable for the investigation of 
possible population specific and independent effects of 
variants. EvoSNP-DB aims to provide information regarding 
genetic diversity among East Asian populations, including 
Chinese, Japanese, and Korean. Non-redundant SNPs (1.6 
million) were genotyped in 54 Korean trios (162 samples) and 
were compared with 4 million SNPs from HapMap phase II 
populations. EvoSNP-DB provides two user interfaces for data 
query and visualization, and integrates scores of genetic 
diversity (Fst and VarLD) at the level of SNPs, genes, and 
chromosome regions. EvoSNP-DB is a web-based application 
that allows users to navigate and visualize measurements of 
population genetic differences in an interactive manner, and is 
available online at [http://biomi.cdc.go.kr/EvoSNP/]. [BMB 
Reports 2013; 46(8): 416-421] 



INTRODUCTION 

Recent developments in high throughput SNP chip tech- 
nologies have enabled researchers to conduct large-scale ge- 
nome-wide association studies (GWAS) (1-6). These have re- 
vealed an unprecedented amount of genetic variants that are 
associated with complex traits (7). As of August, 2012, there 
were 1,330 publications and 6,848 phenotype-associated 
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SNPs in the NHGRI GWAS catalogue (http://www.genome. 
gov/gwastudies/). The availability of plentiful phenotype-re- 
lated genomic information is expected to lead to clinically ap- 
plicable personal genomics in the near future (8, 9), however, 
a number of issues require attention before this can be widely 
applied. Firstly, the identification of causal variants, and a 
functional investigation of known loci are required (10, 11). 
GWAS have localized associated signals to specific genomic 
regions, however, most identified variants are located within 
intergenic, intronic, and gene desert regions, and are regarded 
as proxies for causal variants. Further analysis, such as fine 
mapping and resequencing, is required to unveil causal var- 
iants of phenotypes. Only a small number of genes in close 
proximity to associated variants have been examined to identi- 
fy possible functional relationships with phenotypes. Secondly, 
the majority of GWAS have been conducted on populations 
with European ancestry. This data of European relevence 
should be validated for its application to other ethnicities, such 
as those of Asian or African ancestry. Although some recent 
GWAS have been conducted on ethnic groups other than 
Europeans, sample sizes and numbers of target phenotypes 
have been relatively small compared with studies of Europeans 
(2, 6, 12). It is important to consider population specific asso- 
ciations for personal genomics applications, as phenotype as- 
sociations regularly vary across populations (3-5). 

Population specific or independent associations of variants 
can be identified by GWAS in a specific population, or by in- 
dependent replication studies. However, these approaches re- 
quire a large number of samples, compounding the high costs 
associated with genotyping. As an alternative, the genetic di- 
versity of phenotype-associated regions may be examined. 
Large differences in genetic architecture among populations 
are an established cause of discrepancies in associations (3-5). 
The fixation index (Fst) is one of the most widely used metrics 
for measuring genetic differentiation between populations 

(13) . Variation in linkage disequilibrium (VarLD) is another 
approach that measures population differences in LD patterns 

(14) . Various web interfaces have been developed to provide 
user-friendly graphical interfaces (GUI) and browsers to ac- 
cess genetic variation data, including Haplotter, FstSNP- 
HapMap3, SNP@Evolution, and Singapore Genome Variation 
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Project (SGVP) (15-18). Three of these use only reference in- 
formation, such as data from HapMap phase 2 and phase 3 
(19, 20). SGVP also provides information derived from three 
Southeast Asian populations, as compared to HapMap pop- 
ulations (18). 

Genetic diversity among East Asian populations has not pre- 
viously been provided via a web service. In particular, the 
Korean population is one of the most intensely studied in East 
Asia, but there is no web resource providing genetic diversity 
data which includes Koreans. Although two populations (Han 
Chinese in Beijing (CHB) and Japanese in Tokyo OPT)) are geo- 
graphically close to Korea, they are included in HapMap, He 
et al. reported that HapMap Asian samples (CHB and JPT) 
should not be regarded as references for Koreans (21). We 
therefore developed EvoSNP-DB, a web resource for genetic 
diversity among East Asian populations. 

RESULTS AND DISCUSSION 

We constructed EvoSNP-DB by integrating GBrowse and geno- 
type data from 108 Koreans (founders) and 210 HapMap 
phase II release #22 samples. After quality control, 1,147,845 
SNPs overlapped across Korean and other HapMap popula- 
tions. The EvoSNP-DB database and web server is implemen- 
ted on a 24x2.66 GHz Xeon core server running on Red Hat 
Enterprise Linux (version 5.2), Apache (version 2.0), Tomcat 
(version 5.5), and MySQL (version 5.5). It is viewable in all 
major web browsers and operating systems, and is available 
online at [http://biomi.cdc.go.kr/EvoSNP/]. 

Database design and organization 

Data flow through the application is described in Fig. 1. 
Briefly, genotype data were analyzed to calculate Fst, VarLD, 
allele frequency, and Hardy-Weinberg equilibrium (HWE). 
Processed data are stored in the database with annotation in- 
formation retrieved from UCSC and OMIM. The database is 
wrapped by Gbrowse and JSP for data query and visualization 




Fig. 1. Flow diagram of EvoSNP-DB construction. 



interfaces. Genotype datasets are derived from the Internatio- 
nal HapMap Phase II release #22 data repository (11, 12), in- 
cluding data from 60 Utah residents with ancestry from 
Northern and Western Europe (CEU), 45 Han Chinese in 
Beijing (CHB), 45 Japanese in Tokyo (JPT), and 60 Yoruba in 
Ibadan, Nigeria (YRI). Considering the relatively small num- 
ber of samples of CHB and JPT, we pooled the data of both as 
a single geographical group, and denoted it as ASN (Asian, 90 
samples). The SNP information from 108 Korean founders of 
54 trios was compared to those of HapMap populations. The 
database has been integrated with Fst and VarLD metrics to 
facilitate the graphical representation of the data. Fst measures 
polymorphism within each population and differentiation 
among geographical groups (13). To quantify variation in pop- 
ulation linkage disequilibrium patterns, we used the varLD 
program (14). HapMap, UCSC, OMIM, and the NHGRI 
GWAS catalogue were major sources of annotation infor- 
mation. 

User interface and visualization 

Within EvoSNP-DB there are user interfaces for data queries 
and visualization. Three types of query can be applied: (i) SNP 
identifier, (ii) mRNA ID or gene symbol, and (iii) specific chro- 
mosome region. For example, rs28218 could be used for a 
SNP based search, NM 002124 or ORF4F16 for a gene 
search, and chr1:157661000.. 157806000 to search for this 
chromosomal region. 

Regardless of the query type, EvoSNP-DB returns three ta- 
bles providing Region, Gene, and SNP statistics (Fig. 2). Each 
table contains summary variation scores. Fig. 2 illustrates the 
output when rs28218 was used as a search term; scores of the 
gene TRIO, which contains this SNP, are summarized in the 
gene statistics table. JSP and GMOD (http://gmod.org) were 
used to build the table and figure interfaces. Links to public 
online databases, including Entrez Nucleotide, dbSNP, OMIM 
(22), and HapMap (20), are provided in EvoSNP-DB, together 
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Fig. 2. A screenshot of the result table from EvoSNP-DB. 
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with the results (Fig. 2). EvoSNP-DB also offers a generic ge- 
nome browser, which displays overviews of chromosomes, 
contigs, genes, mRNAs, and SNPs (23). Figs. 3 and 4 demon- 
strate the output if small or large numbers of SNPs exist in the 
query region, respectively. 

EvoSNP-DB provides an open-architecture website using a 



wiki interface for data access (a wiki is a website that allows its 
users to add, modify, or delete its content via a web browser), 
and wiki-based SNP annotation will be available in the near 
future. This will be particularly useful for constructing accurate 
and informative annotation for variants identified by the col- 
laborative work of many researchers. MySQL, Python, JSP, and 
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Fig. 3. A detailed screenshot showing 
EvoSNP-DB search results. Top track: 
chromosomal overview. SNP locations, 
diamond shapes. OMIM disease asso- 
ciations, rectangles. Second track: 
VarLD scores visualized along a 2 Mb 
chromosomal region. Third track: al- 
lele frequencies of SNPs, visualized as 
a pie chart for the Korean population 
or as towers for HapMap populations. 
Bottom track: Genes in the region. 
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Fig. 4. A wide screenshot showing the 
search results with OMIM and GWAS 
Catalogue. Allele frequency is not dis- 
played, but each SNP is indicated. 
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GBrowse were used in database construction, and to enhance 
interface utility (24). 

MATERIALS AND METHODS 

Korean genotype data 

Previously, we conducted GWAS for two independent Korean 
population-based cohorts (Ansung and Ansan) as part of the 
Korean Genome Epidemiology Study (KoGES), and the Korean 
Association REsource (KARE) project, which was initiated in 
2007 (2, 4). In the Ansung area, we recruited additional family 
members of the original participants to facilitate family based 
association studies. Among these, 54 trios (162 samples) were 
genotyped using an Affymetrix Genome-Wide Human SNP ar- 
ray 6.0 and an lllumina human Omni1-Quad Chip. Genotypes 
were called with Birdseed and BeadStudio GenCall for 
Affymetrix and lllumina arrays, respectively (25, 26). Initially, 
~1.9 million SNPs from the two platforms (909,622 for 
Affymetrix and 1,010,624 for the lllumina array) were merged. 
For quality control, we excluded SNPs using the following cri- 
teria: non-autosomal, mendelian errors, high missing genotype 
rate (> 5%), and deviation from HWE (P < 1E-6). Filtered 
SNPs were compared with data from HapMap SNPs, including 
allele, strand, and genomic position. After excluding 14 SNPs 
with annotation errors, 1,147,845 SNPs were overlapped with 
HapMap SNPs (27). 

HapMap genotype data 

HapMap phase II release #22 data (210 samples) were down- 
loaded. Genotype data were converted to the PLINK binary 
genotype format, and genotype frequencies, allele frequencies, 
and P-values of HWE calculated using PLINK (28). 

Analysis of genetic diversity among populations 

Fst and VarLD were used as population genetic diversity met- 
rics (13, 14). Fst was calculated for each SNP by a pairwise 
comparison of four populations. Genome-wide VarLD analysis 
was performed; VarLD scores were calculated for windows of 
50 SNPs, starting from the first SNP of each chromosome and 
ending with the last. All values from 22 chromosomes were 
merged and were converted to provide a standard normal dis- 
tribution (mean=0, standard deviation = 1). VarLD analysis 
procedures were performed for all pairs of populations. To ac- 
cess the degree of genetic difference between populations, we 
calculated the quartiles of Fst and VarLD score distributions. 
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